Goto

Collaborating Authors

 continuous integration


An LLM-based Quantitative Framework for Evaluating High-Stealthy Backdoor Risks in OSS Supply Chains

Yan, Zihe, Luo, Kai, Yang, Haoyu, Yu, Yang, Zhang, Zhuosheng, Li, Guancheng

arXiv.org Artificial Intelligence

In modern software development workflows, the open-source software supply chain significantly contributes to efficient and convenient engineering practices. With increasing system complexity, it has become a common practice to use open-source software as third-party dependencies. However, due to the lack of maintenance for underlying dependencies and insufficient community auditing, ensuring the security of source code and the legitimacy of repository maintainers has become a challenge, particularly in the context of high-stealth backdoor attacks such as the XZ-Util incident. To address these problems, we propose a fine-grained project evaluation framework for backdoor risk assessment in open-source software. Our evaluation framework models highly stealthy backdoor attacks from the attacker's perspective and defines targeted metrics for each attack stage. Moreover, to overcome the limitations of static analysis in assessing the reliability of repository maintenance activities, such as irregular com-mitter privilege escalation and insufficient review participation, we employ large language models (LLMs) to perform semantic evaluation of code repositories while avoiding reliance on manually crafted patterns. The effectiveness of our framework is validated on 66 high-priority packages in the Debian ecosystem, and the experimental results reveal that the current open-source software supply chain is exposed to a series of security risks.


Systematic Literature Review on Application of Learning-based Approaches in Continuous Integration

Arani, Ali Kazemi, Le, Triet Huynh Minh, Zahedi, Mansooreh, Babar, M. Ali

arXiv.org Artificial Intelligence

Context: Machine learning (ML) and deep learning (DL) analyze raw data to extract valuable insights in specific phases. The rise of continuous practices in software projects emphasizes automating Continuous Integration (CI) with these learning-based methods, while the growing adoption of such approaches underscores the need for systematizing knowledge. Objective: Our objective is to comprehensively review and analyze existing literature concerning learning-based methods within the CI domain. We endeavour to identify and analyse various techniques documented in the literature, emphasizing the fundamental attributes of training phases within learning-based solutions in the context of CI. Method: We conducted a Systematic Literature Review (SLR) involving 52 primary studies. Through statistical and thematic analyses, we explored the correlations between CI tasks and the training phases of learning-based methodologies across the selected studies, encompassing a spectrum from data engineering techniques to evaluation metrics. Results: This paper presents an analysis of the automation of CI tasks utilizing learning-based methods. We identify and analyze nine types of data sources, four steps in data preparation, four feature types, nine subsets of data features, five approaches for hyperparameter selection and tuning, and fifteen evaluation metrics. Furthermore, we discuss the latest techniques employed, existing gaps in CI task automation, and the characteristics of the utilized learning-based techniques. Conclusion: This study provides a comprehensive overview of learning-based methods in CI, offering valuable insights for researchers and practitioners developing CI task automation. It also highlights the need for further research to advance these methods in CI.


Automating the Training and Deployment of Models in MLOps by Integrating Systems with Machine Learning

Liang, Penghao, Song, Bo, Zhan, Xiaoan, Chen, Zhou, Yuan, Jiaqiang

arXiv.org Artificial Intelligence

This article introduces the importance of machine learning in real-world applications and explores the rise of MLOps (Machine Learning Operations) and its importance for solving challenges such as model deployment and performance monitoring. By reviewing the evolution of MLOps and its relationship to traditional software development methods, the paper proposes ways to integrate the system into machine learning to solve the problems faced by existing MLOps and improve productivity. This paper focuses on the importance of automated model training, and the method to ensure the transparency and repeatability of the training process through version control system. In addition, the challenges of integrating machine learning components into traditional CI/CD pipelines are discussed, and solutions such as versioning environments and containerization are proposed. Finally, the paper emphasizes the importance of continuous monitoring and feedback loops after model deployment to maintain model performance and reliability. Using case studies and best practices from Netflix, the article presents key strategies and lessons learned for successful implementation of MLOps practices, providing valuable references for other organizations to build and optimize their own MLOps practices.


Systematic Literature Review on Application of Machine Learning in Continuous Integration

Arani, Ali Kazemi, Le, Triet Huynh Minh, Zahedi, Mansooreh, Babar, Muhammad Ali

arXiv.org Artificial Intelligence

This research conducted a systematic review of the literature on machine learning (ML)-based methods in the context of Continuous Integration (CI) over the past 22 years. The study aimed to identify and describe the techniques used in ML-based solutions for CI and analyzed various aspects such as data engineering, feature engineering, hyper-parameter tuning, ML models, evaluation methods, and metrics. In this paper, we have depicted the phases of CI testing, the connection between them, and the employed techniques in training the ML method phases. We presented nine types of data sources and four taken steps in the selected studies for preparing the data. Also, we identified four feature types and nine subsets of data features through thematic analysis of the selected studies. Besides, five methods for selecting and tuning the hyper-parameters are shown. In addition, we summarised the evaluation methods used in the literature and identified fifteen different metrics. The most commonly used evaluation methods were found to be precision, recall, and F1-score, and we have also identified five methods for evaluating the performance of trained ML models. Finally, we have presented the relationship between ML model types, performance measurements, and CI phases. The study provides valuable insights for researchers and practitioners interested in ML-based methods in CI and emphasizes the need for further research in this area.


Build Reliable Machine Learning Pipelines with Continuous Integration

#artificialintelligence

As a data scientist, you are responsible for improving the model currently in production. After spending months fine-tuning the model, you discover one with greater accuracy than the original. Excited by your breakthrough, you create a pull request to merge your model into the main branch. Unfortunately, because of the numerous changes, your team takes over a week to evaluate and analyze them, which ultimately impedes project progress. Furthermore, after deploying the model, you identify unexpected behaviors resulting from code errors, causing the company to lose money.


SoK: Machine Learning for Continuous Integration

Arani, Ali Kazemi, Zahedi, Mansooreh, Le, Triet Huynh Minh, Babar, Muhammad Ali

arXiv.org Artificial Intelligence

Abstract--Continuous Integration (CI) has become a wellestablished software development practice for automatically and continuously integrating code changes during software development. An increasing number of Machine Learning (ML) based approaches for automation of CI phases are being reported in the literature. It is timely and relevant to provide a Systemization of Knowledge (SoK) of ML-based approaches for CI phases. Our systematic analysis also highlights the deficiencies of the existing ML-based solutions that can be improved for advancing the state-of-the-art. Given the variety of employed techniques in applying ML solutions in CI, and growing interest in this domain, it is In recent years, the software development industry has seen necessary to systematically identify state-of-the-art practices a significant shift towards the adoption of Continuous Integration used for automating CI tasks through ML methods.


Automating Machine Learning Pipelines with CI/CD/CT: A Guide to MLOps Best Practices

#artificialintelligence

MLOps, short for Machine Learning Operations, is an emerging practice that brings together the disciplines of machine learning and DevOps to streamline the entire lifecycle of machine learning models, from development to deployment and beyond. One of the key aspects of MLOps is the use of automation to improve the efficiency, reliability, and quality of machine learning pipelines. In this tutorial, we will explore how to use Continuous Integration (CI), Continuous Delivery (CD), and Continuous Testing (CT) to automate the deployment of machine learning models. Before we dive into the details of MLOps automation, let's briefly explain the three key concepts that underpin it: MLOps automation typically involves a series of steps that automate the entire machine learning pipeline, from data preparation to model deployment. To automate this process, we can use a combination of CI/CD/CT tools and techniques.


Automate Model Deployment with GitHub Actions and AWS

#artificialintelligence

This article was published as a part of the Data Science Blogathon. In a typical software development process, the deployment comes at the end of the software development life cycle. First, you build software, test it for possible faults, and finally deploy it for the end user's accessibility. The same can be applied to machine learning as well. In a previous article, I described how we could build a model, wrap it with a Rest API, containerize it, and finally deploy it on cloud services.


MLOps & Machine Learning Pipeline Explained - Medi-AI

#artificialintelligence

MLOps is a compound term that combines "machine learning" and "operations." The role of MLOps, then, is to provide a communication conduit between data scientists who work with machine learning data and the operations team that manages the project. To do so, MLOps applies the type of cloud-native applications used in DevOps to machine learning (ML) services, specifically continuous integration/continuous deployment (CI/CD). Although both ML and normal cloud-native apps are written in (ok, result in) software, there is more to ML services than just code. While cloud-native apps require source version control, automated unit-/load -testing, AB testing, and final deployment, MLOps uses a data pipeline, ML model training, and more complex deployment with special purpose logging-monitoring capabilities.


5 ways machine learning uses CI/CD in production

#artificialintelligence

Continuous integration (CI) is the process of all software developers merging their code changes in a central repository many times throughout the day. A fully automated software release process is called continuous delivery, abbreviated as CD. Although the two terms are not interchangeable, CI/CD is a DevOps methodology and fits in that category. A continuous integration/continuous delivery (CI/CD) pipeline is a system that automates the software delivery process. CI/CD pipelines generate code, run tests, and deliver new product versions when software is changed.